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Abstract 

We present a theoretical analysis and empirical evaluations of a novel set of tech¬ 
niques for computational cost reduction of classifiers that are based on learned 
transform and soft-threshold. By modifying optimization procedures for dic¬ 
tionary and classifier training, as 'well as the resulting dictionary entries, our 
techniques allow to reduce the bit precision and to replace each floating-point 
multiplication by a single integer bit shift. We also show how the optimization 
algorithms in some dictionary training methods can be modified to penalize 
higher-energy dictionaries. We applied our techniques with the classifier Learn¬ 
ing Algorithm for Soft-Thresholding, testing on the datasets used in its original 
paper. Our results indicate it is feasible to use solely sums and bit shifts of 
integers to classify at test time with a limited reduction of the classification 
accuracy. These low power operations are a valuable trade off in FPGA imple¬ 
mentations as they increase the classification throughput while decrease both 
energy consumption and manufacturing cost. 
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1. Introduction 


In image classification, feature extraction is an important step, specially in 
domains where the training set has a large dimensional space that requires a 
higher processing and memory resource. A recent trend in feature extraction 
for image classification is the construction of sparse features, where these fea¬ 
tures consist in the representation of the signal in an overcomplete dictionary. 
When the dictionary is learned specific to the input dataset, the classification 
of sparse features can achieve results comparable to state-of-the-art classifica¬ 
tion algorithms [T]. However, this approach has a drawback at test time, as 
the sparse coding of the input test sample is computationally intense, being im¬ 
practicable to embedded applications that have scarce computational and power 
resources. 

A recent approach to this drawback is to learn a sparsifying transform from 
the target image dataset EISJIl]. Therefore, the learned classifier has an ar¬ 
chitecture that can be seen as a feedforward neural network (FFNN) with one 
hidden layer and no bias. At test time, this approach reduces the sparse cod¬ 
ing of the input image to a simple matrix-vector multiplication followed by a 
soft-threshold, which can be efficiently realized in hardware due to its inher¬ 
ent parallel nature. Nevertheless, these matrix-vector multiplications require 
floating-point operations, which may have a high cost in hardware, specially 
in FPGA, as it increases the fabrication cost and demands a higher energy to 
operate. 

Exploring some properties we derive from these classifiers, we propose a 
set of techniques to reduce their computational cost at test time, which we 
divide into four main groups: (i) decrease the dynamic range of the dictionary 
first by penalizing the I 2 norm of its entries at the training phase, then by 
zeroing out its entries that have absolute values smaller than a trained threshold; 
(ii) use test images in integer — which is the same format they are sampled by 
analog-to-digital converters (ADC) — instead of their scaled normalized version 
(floating-point) and thus replace the costly floating-point operations by integer 
operations, which are cheaper to implement in hardware and do not affect the 
classification accuracy; (iii) quantize the integer valued test images and thus 
decrease the number of bits needed to represent them; (iv) and quantize both 
transform dictionary and classifier by approximating its entries to their nearest 
power of 2 and thus replace each multiplication by a simple bit shift. 

From now on, we refer to this set of techniques as xQuant. As a study 
case for xQuant, we use a recent classification algorithm named Learning Al¬ 
gorithm for Soft-Thresholding classifier (LAST), which learns both the sparse 
representation of the signals and the hyperplane vector classifier at the same 
time. Our tests use the same datasets used in the paper that introduces LAST 
and our results indicate that our techniques reduce the computational cost while 
not substantially degrading the classification accuracy. Moreover, in a partic¬ 
ular dataset we tested, our techniques substantially increased the classification 
accuracy. 

To the best of our knowledge, this paper presents the first generic approach 
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to reduce the computational cost at test time of classifiers that are based on 
learned transform. This has a valuable application in embedded systems where 
power consumption is critical and computational power is restricted. Further¬ 
more, xQuant dismiss the necessity of using DSPs for intense matrix-vector 
operations in FPGAs architectures for image classification, lowering the overall 
manufacturing cost of embedded systems. 

Even though all simulations we ran to test our techniques were performed on 
image classification using LAST, our proposed techniques are sufficiently general 
to be applied on different problems and different classification algorithms that 
use matrix-vector multiplications to extract features, such as Extreme Learning 
Machine (ELM) and Deep Neural Networks (DNN) |5]. 

2. Related Work 

The literature on reducing the computational cost of classifiers is vast and 
thus we only present some of the significant trends. Also, it is worth noting 
that quantization strategies to reduce resource usage of FFNN classifiers im¬ 
plemented in FPGA are not new and have been used in the past century with 
success. In |7] for example, a quantization scheme is proposed to eliminate all 
multiplications during the test time. After training the parameters of a feed¬ 
forward neural network, they approximate these parameters to a power of two 
and retrain the network letting only the bias values to change freely in the real 
domain, as these bias do not participate in multiplications. This reduces each 
multiplication to a single operation of bit shift. The problem with this approach 
is that it still relies on floating-point operations, which are costly in applications 
with limited energy and/or small computational power. 

In |8], [9], and [10], different quantization strategies are presented to allow 
the use of fixed-point values during the training and test time. These works lack 
the power reducing benefits from quantization schemes that approaches the net¬ 
work parameters to powers of two as in mini. This was probably an unknown 
feature to the authors. In m, the authors start to experiment with quantiza¬ 
tion schemes that allow a higher computational cost reduction. They quantize 
the network parameters to have only -Is and Is to reduce multiplications to 
simple sign changes with only a small decrease of the classification accuracy. 
m and na also follow the same lead. This quantization scheme is drastic and 
eliminates all multiplications and bit shifts at test time, but may substantially 
reduce the learning capacity of the neural network. In m, the authors propose 
a post-processing scheme to approximate both the trained parameters of a GNN 
and the input images to -Is and Is. This approach allows the convolutions to 
be estimated by XNOR and bit-counting operations. Nevertheless, this over¬ 
simplification comes with the price of a higher degradation of the classification 
accuracy compared to the original classifier. 

Our approach differs from these aforementioned in many points. First, it 
can be easily adapted to any learning algorithm as it does not rely on a spe¬ 
cific one, and, thus, can be used in different network architectures and different 
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amounts of neurons. Also, xQuant can also be applied after training the clas¬ 
sifier. Second, it drops all floating-point operations in favor of integer ones. 
This avoids the costly normalization and denormalization techniques required 
in floating-point operations. Third, it has an optional strategy to reduce the 
dynamic range of the parameters during training and consequently reduce the 
number of bits necessary to store them. This strategy penalizes parameter val¬ 
ues that causes an increase in the dynamic range by forcing them to be closer to 
their average. Fourth, xQuant does not hurt much the classification accuracy as 
the approximation to -Is and Is performed in some of the previously mentioned 
works. 


3. Overview of Sparse Representation Classification 

In this section, we briefly review both synthetical and analytical sparse rep¬ 
resentation of signals along with the threshold operation used as a sparse coding 
approach (Section [3d^ . We also review LAST (Section [T^ . 

3.1. Sparse Representation of Signals 

Let X G K" be a signal vector and D S be an overcomplete dictionary. 

The sparse representation problem corresponds to finding the coefficient vector 
z* G that minimizes the £o norm 

z* = argmin ||z||q s.t. x = Dz, (1) 

Z 

where H-Hq measures the number of nonzero coefficients. Therefore, the signal x 
can be synthesized as a linear combination of k nonzero columns of the dictionary 
D, also called synthesis operator. The solution of Q requires testing all possible 
sparse vectors z, which is a combination of N entries taken fc at a time. This 
is an NP-hard problem, but an approximate solution can be obtained by using 
the £i norm instead of the £o norm, i.e. 

z* = argmin ||z||j^ s.t. x = Dz, (2) 

Z 

where H-H]^ is the £i norm. The solution of ([^ can be computed by solving the 
problem of minimizing the £i norm of the coefficients among all decompositions, 
which is convex and can be solved efficiently. If the solution of (§ is sufficiently 
sparse, it will be equal to the solution of 0 [Si- 

Sparse coding transform U is another way of sparsifying a signal, where 
the dictionary is a linear transform that maps the signal to a sparse representa¬ 
tion. For example, signals formed by the superposition of sinusoids have a dense 
representation in the time domain and a sparse representation in the frequency 
domain. For this type of signal, the Fourier transform is the sparse coding trans¬ 
form. Quite simply, D^x = z is the sparse transform of x, where z is the sparse 
coefficient vector. In general, the transform D, can be a well structured fixed 
base such as a DFT or learned specifically to the target problem represented in 
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the training dataset. A learned dictionary can be an overcomplete dictionary 
learned from the signal dataset, as in [3], a square invertible dictionary, as in |3], 
or even a dictionary without restrictions on the number of atoms, as in LAST 

0 . 

When a signal is corrupted by additive white Gaussian noise (AWGN), its 
transform will result into a coefficient vector that is not sparse. A common way 
of making it sparse is to apply a threshold operation to its entries right after 
the transform, where the entries lower than the specified threshold are set to 
zero. Among the existing threshold operators, soft-threshold is the one that, 
in addition to the threshold operation, subtracts the remaining values by the 
threshold, shrinking them toward zero m- 

Let z = be the coefficients of a sparse representation of a signal 

corrupted by AWGN given by 


z^ = Si + eCi i = l, ..., N (3) 

where is the noise i.i.d. as ^(0,1), e > 0 is the noise level, and Si are the 
coefficients of the sparse representation of the pure signal. 

Because the Si coefficients in ^ are sparse, there exists a threshold a 
that can separate most of the pure signal Si from the noise using the soft- 
thresholding operator m 

/ia(z) = sgn(z)max(0, |z| - a), (4) 

where sgn(-) is the sign function. For classification tasks, the best estimate of 
a can be computed using the training set. 

3.2. Learning Algorithm for Soft-Thresholding Classifier (LAST) 

LAST 0 is an algorithm based on a learned transform followed by a soft- 
threshold, as described in Section [3T| Differently from the original soft-threshold 
map Q, LAST uses a soft-threshold version that also sets to zero all negative 
values, i.e., ha(z) = max(0, z — a), where a is the threshold, also called spar¬ 
sity parameter. When a = 0, this threshold operator can be seen as the relu 
activation function, which has produced good results in deep neural network 
architectures UHl [m HOI I2I]- We chose LAST to be our study case because 
of its simplicity in the learning process of the sparsifying dictionary and the 
classifier hyperplane. 

For the training cases X = [xi |... |x„] £ ^ith labels y = [yi | • • • |ym] € 

{ —1,1}™, the sparsifying dictionary D £ that contains N atoms and the 

classifier hyperplane w £ are estimated using the supervised optimization 

m 

mm'^H{yiW^ha{J:>^Xi)) + ^\\w\\l, (5) 

D.w ^^ Z 

where H is the hinge loss function H{x) = max(0 ,1 — x) and v is the regulariza¬ 
tion parameter that prevents the overfitting of the classifier w to the training 
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set. At test time, the classification of each test case x is performed by first 
extracting the sparse features from the signal x, using 

f = max(0, D^x — a), (6) 

and then by the classification of these features using c = w^f > 0, where c is 
the class returned by the classifier. We direct the reader to [5], for a deeper 
understanding of LAST. 


4. Proposed Techniques 

In this section we introduce a set of techniques for simplifying the test¬ 
time computations of classifiers based on learned transforms and soft-threshold. 
We start by describing in Section |4.1| the dataset of images to which we apply 
the proposed techniques for validation. Next, we present in Section |4.2| our 
main theoretical findings supporting xQuant, which are finally presented in 
Section 14.31 

4 ..I. Datasets for Training and Validation 

The first two datasets contain patches extracted from the textures presented 
in Figure]^ which belong to the Brodatz dataset |22]. The built the datasets 
using the following methodology: First, we separate each image in half and then 
use the left half to create the 500 training patches and the right half to create 
the 500 test patches. These patches are subsets of each image containing 12 x 12 
pixels. Next, for each patch we stack its 12 columns and then normalize the 
resulting vector to have £2 norm equals to 1. As in [2], the first task consisted in 
discriminating test patches from the images bark and woodgrain, and the second 
task consisted in discriminating patches from the images pigskin and pressedcl. 
For future reference, we named the first task as bark_woodgrain and the second 
task as pigskin_pressedcl. 



(a) bark (b) woodgrain (c) pigskin (d) pressedcl 

Figure 1: Textures we used to generate the first two binary datasets. 


The third binary dataset was built using a subset of the CIFAR-10 image 
dataset |^. This dataset contains 10 classes of 60 000 32 x 32 tiny RGB images, 
with 50 000 images in the training set and 10 000 in the test set. Each image 
has 3 color channels and it is stored in a vector of 32 x 32 x 3 = 3 072 positions. 
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The dataset we used was the subset formed by the images labeled as deer or 
horse. 

The first multiclass dataset was the MNIST dataset [24], which contains 
70 000 images of handwritten digits of size 28 x 28 distributed in 60 000 images 
in the training set and 10 000 images in the test set. As in |2], all images have 
zero-mean and £2 norm equals to 1 . 

The last task consisted in the classification of all 10 classes from the CIFAR- 
10 image dataset. 

4-.2. Theoretical Results on Computational Cost Reduction 

For the purpose of brevity, we coined the term powerize to concisely describe 
the operation of approximating each value from a set of values to its respective 
closest power of 2 . 

Theorem 1. The relative distance R{x) between any real scalar x and its pow- 
erized version P 2 {x), defined by R{x) = \P 2 {x) — {x)\/x, is upper bounded by 

1/3. 

Proof. Let 2"^ < x < 2"+^, n G Z and dp.^{x) = \P 2 ix) — (a:)| be the distance 
between x and its powerized version. The distance dp.^(x) is maximum when 
X = Xra = 5 (2”“''^ + 2”) = 2”“^ 3, which is the middle point between both 
closest power of 2 . 

Therefore, the distance dp^{Xm) = \xm — 2"| = |2”“^ 3 — 2"| = |2"“^ (3 — 2) 
12 "“^ I = and so the maximum relative distance between x and its powerized 
version is R(x) = dp^^Xm)/Xm, which is equal to 1/3. □ 

We now show how the classification accuracy on the test is influenced by 
small variations introduced in the entries of the model (D,w). Using the 
datasets bark_woodgrain and pigskin_pressedcl described in Section |4.1[ we 
trained an initial model (D, w), with 50 atoms, and created 50 versions (D, w)*, 
i = 1,2, ...,50 using the following steps. Each model (D,w)® were built by 
multiplying the entries of the initial model (D,w) by a random value cho¬ 
sen from the uniform distribution on the open interval (1 — 1 -I- di), where 

di G {0.02, 0.04,0.06,..., 1}. Next, we evaluated all models on the test set. 

To get a better estimate of the classification accuracy of each model, we 
performed the above steps ten times on different initial models (D,w) trained 
using different initial values. The results, shown in Figure indicate a clear 
trade-off between the classification accuracy and how far the entries of (D,w)® 
are displaced from the corresponding entries of the original models (D,w). 

Hypothesis 1. The model (D,w) can be powerized at the cost of a slight clas¬ 
sification accuracy decrease. 

It is worth noting that the Theorem guarantees an upper bound of 1 /3 
for the relative distance between any real scalar x and its powerized version. 
Therefore, it is reasonable to hypothesize that the classification accuracy using 
the powerized pair (D,w)pou,er is no worse than using (D,w)% when di = 1/3, 
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Figure 2: Impact on the classification accuracy when the entries of the dictionary D and 
classifier w are randomly modified up to a certain level d. 


shown in Fignrej^ To support this hypothesis, we performed another simulation 
with the datasets bark_woodgrain and pigskin_pressedcl. for each dataset, we 
trained 10 models (D,w)® on different random versions of the training set and 
evaluated them and their respective powerized versions (D, on the test 

set. Regarding the bark_woodgrain dataset, the original model accuracy were 
97.33% (0.93) and the powerized model accuracy were 97.00% (1.06). As for 
the pigskin_pressedcl, the original model accuracy were 84.00% (1-61) and the 
powerized model accuracy were 82.65% (1.26). 

Theorem 2. Let be a training set formed integer valued vectors and X be 
its normalized version with norm = 1, where the model (D,w) is trained on. 
The classification accuracy of the both raw signals yiint o.nd normalized signals 
X are exactly the same when the sparsity parameter a in is a= |lxi„t ||2 for 
each Xijii ^ X^,ii . 

Proof. Let 'x.int and x be respectively a raw vector from the test set and its 
normalized version, with ||x ||2 = 1. Let also (D,w) be the model trained 
with 0=1. Therefore, the extracted features are f = D^x = n 

II2 

and the soft-thresholded feature is ia = max(0, f — a) = max(0, - 

1) = niax(0, D^Xi„( - ||x„t|| 2 ). Finally, the classification of x„t is 

c= (wp^^max(0,DTx„t - ||x„t|| 2 ) > 0). 

As the £2 norm of any real vector different from the null vector is always 
greater than 0, then > 0, and, thus c = (wmax(0,D^Xi„i — ||xi„t|| 2 ) > 

0 ). 

Therefore, as x = Xi„t/ ||xi„t|| 2 , the expressions c = (wmax(0,D^x— a) > 
0 ), with 0 = 1 , and c = (wmax(0, D^x^^t — a) > 0), with a = ||xi„ 4||2 are 
equivalent. □ 

Empirical evidence 1. Forcing the dictionary D to be sparse by hard thresh¬ 
olding its entries up to a certain level will decrease its dynamic range and thus 
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reduce the number of bits necessary to compute D^X at the cost of a slight 
classification accuracy decrease. 

We hypothesized that forcing D to be sparse would decrease its dynamic 
range with no substantial decrease of its classification accuracy. To support our 
hypothesis we performed another simulation with the datasets bark_woodgrain 
and pigskin_pressedcl. For each dataset, we trained a model (D, w) and created 
14 versions of it by hard-thresholding the entries of D using 14 threshold values 
linearly spaced between 0 and 4. Subsequently, we divided each element of the 
hard-thresholded dictionary Dj by the lowest value from |Dt| that is different 
from 0. 

Finally, we evaluated all resulting models on the test set. For a better 
estimate of the classification accuracy, we performed the above steps on 10 
models (D,w) trained on different random versions of the training set and 
computed their average. As shown in Figure]^ a), the first threshold different 
from zero already reduces the bit precision of D* to less than half of the original 
while slightly decreasing its classification accuracy. Also, the third threshold 
different from 0 shown in Figure |^b) almost maintains the same classification 
accuracy while reducing its dynamic range to less than half of the original. 
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Figure 3: Impact on the classification accuracy when hard threshold is used to reduce the 
bit precision of dictionary D. The values shown are the average of the classification accu¬ 
racy on the test set evaluated with 10 models (D,w), with 50 atoms, trained with different 
training sets. The original results are marked with a red circle. The datasets are described in 
Section |4.1| 
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Empirical evidence 2. Quantizing the integer valued images from the test set 
^int up to a certain level will decrease the dynamic range of 'K.int and thus 
reduce the number of bits necessary to compute at the cost of a slight 

classification accuracy decrease. 

We also hypothesized the original integer valued signals were unnecessarily 
over quantized and that their quantization level could be decreased while not 
substantially worsening the classification accuracy. To support our hypothe¬ 
sis, we performed another simulation with the datasets bark_woodgrain and 
pigskin_pressedcl. For each dataset, we averaged the results of one thousand 
runs consisting in 10 models (D,w) trained using different training sets and 
evaluated on different quantized versions of the test set. The images from each 
test set Xint were quantized using levels ranging from 1 to 15. The results 
are shown in Figure Its worth noting in this figure that images from both 
datasets can have their bit precision reduced to 2 (Quantization level equals to 
2 and 3) while having a limited decrease of the classification accuracy. 
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Figure 4: Impact on the classification accuracy when the images of the test set are quantized 
up to a certain level. The original results are marked with a red circle. Note that reducing 
the bit precision of the test set images to as low as 2 bits does not substantially worsens the 
classification accuracy. These results are the average of the classification results of the test 
set evaluated with 10 models (D, w), with 50 atoms, trained with different training sets. The 
datasets are described in Section|4.1| 
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4-.3. Proposed Techniques 

Technique 1. Use signals in its raw representation (in integer) rather than 
their normalized version (in floating-point). 

Technique 2. Powerize D and w. 

Technique 3. Decrease the dynamic range of the test set by quantizing 

the integer valued test images H-mt ■ 

Technique 4. Decrease the dynamic range of the entries of D by penalizing 
their i 2 -norm during the training followed by hard-thresholding, using a trained 
threshold level. 

Our strategy to decrease the dynamic range of the dictionary D involves the 
addition of a penalty to the £2 norm of its entries during the minimization of the 
objective function of LAST, described in ([^. The motivation for penalizing the 
£2 of w and D is the fact that this can avoid solutions containing high-valued 
entries, which would require a representation using more bits. Also note that 
penalizing the £ 1 , which would seem more reasonable in terms of providing sparse 
dictionaries, would still allow for higher entries (even if in small numbers), which 
would anyway require more bits for proper quantization. The new proposed 
optimization problem hence becomes 

m 

mm'^H{y.,w^ha{'D^Xi)) + ^ ||w||2 -£ ^ |1D||2 , (7) 

D.w Z Z 

2 = 1 

where n controls this new penalization. In Section |4.4[ we show our proposed 
technique of including this penalization into general constrained optimization 
algorithms, followed by how we included this penalization into the difference of 
convex (DC) optimization algorithm used in LAST [2]. 

After training D and w using the modified objective function Q, we ap¬ 
ply a hard-threshold to its entries to zero out the values closer to zero. Our 
assumption is that these small values of D have little contribution on the final 
feature value and, thus, can be set to zero without affecting much the classifica¬ 
tion accuracy. As for the threshold value, we test the best one from all unique 
absolute values of D after it has been powerized using our Technique As 
the number of unique absolute values of D is substantially reduced after using 
the Technique the computational burden to test all possible values is greatly 
reduced. 

4 . 4 . Inclusion of an £2 Norm Penalization Term in Dictionary Training Algo¬ 
rithms Based on Constrained Optimization 
We show how to include a term into the objective function that penalizes 
potential dictionaries whose entries have larger energy values, as opposed to 
lower-energy dictionaries. By favoring vectors with lower energies, we may ob¬ 
tain dictionaries which span over narrower ranges of values. In our development, 
we consider the inclusion of this penalization into gradient descent (GD) meth¬ 
ods, as many optimization problems are based on GD m- In our experimental 
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evaluations, we test the proposed methods by modifying the algorithm in [5], 
which use GD to solve the optimization problem. The development in this sec¬ 
tion applies to both our modifications in [ 2 ] and to other methods based on 
GD. 

Several dictionary and classifier training methods are based on constrained 
optimization programs such as 011 ] 

min /(V,w) s.t. ff(V,'w) = 0, ( 8 ) 

V ,w 

where: (i) V is an ni x 1 vector containing the dictionary terms and w is an 
712 X 1 vector of classifier parameters; (ii) / : K" — >■ M, n = ni -I- 712, is the cost 
function based on the training set; (iii) 0 is the null vector; (iv) and g : K™ —>■ K 
is a function representing m scalar equality constraints. Some methods also 
include inequality constraints. 

In order to penalize the total energy associated to the dictionary entries, we 
can replace any problem of the form ([^ by 

min/(V,w)-h«:i IIVII 2 s.t. g(V,w) = 0, (9) 

V,w Z 


where k > 0 is a penalization weight. 

Iterative methods are commonly used to solve constrained optimization prob¬ 
lems [5^ such as ([^. They start with an initial value x° = [V° for 

X = [V w]^, which is iterated to generate a supposedly convergence sequence 
x^") satisfying 


x("+i) = -h ^Ax("\ V 71 > 0, 


( 10 ) 


where ^ is the step size and Ax^"^ = [AV*-"^ Aw^"^] is the step computed based 
on the particular iterative method. 

We consider the GD method, where computing Ax^") requires evaluating 
the gradient of a dual function associated with the objective function and the 
constraints [^. Specifically, the Lagrangian L(V,w) is an example of a dual 
function, thus having a local maximum that is a minimum of the objective 
function at a point that satisfies the constraints. For problems ([^ and , the 
Lagrangian functions are given respectively by 

L(V, w. A) = /(V, w) -I- A^ 5 (V, w) and (11) 

L(V, w. A) = /(V, w) + X^g{V, w) + « ^ || V||^ , (12) 


with A the vector of m Lagrange multipliers. 

Our first objective regarding solving the modified problem ([^ is to compute 
the gradient of L(V, w. A) in terms of the gradient of L(V, w. A), so as to show 
how a problem that solves can be modified in order to solve By compar¬ 
ing (111 and (12 1 , and by defining Vv 5 as the gradient of any function g with 
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respect to vector v as we compute the gradients, we obtain 

Vvi(V,w,A) = Vvi(V,w,A) + KV, (13) 

Vwi(V, w, A) = Vwi(V, w, A), and (14) 

VaL(V,w,A) = VaL(V,w,A). (15) 


Equations (131, (141, and show how we modify the estimated gradient 
in any GD method (such as LAST [2]) in order to penalize the range of the 
dictionary entries, and thus try to force a solution with a narrower range. Note 
that only the gradient with respect to the dictionaries is altered. 


5. Simulations 


In this section, we evaluate how our techniques affect the accuracy of LAST 
on the same datasets used in [2]. For this, we performed many simulations 
using the datasets presented in Section 4.1 and compared their classification 
accuracy/error and classification bit precision, that is, minimum number of bits 
necessary to perform the classification. We present in Section [5T| the parameters 
we chose to generate these models and, at last, the analysis of the results we 
obtained comes in Section 15.31 


5.1. Choice of Classifier Parameters 

For all tested datasets, we fixed the parameter k = {4, 8 ,10,..., 20} * 10“^ 
and let z_threshold assume all unique values of the powerized version of T)power, 
i.e., after applying the Technique]^ As the number of unique values of Dpower 
is substantially lower than the ones of D, the necessary computational burden 
to test all valid thresholds is low. Also, we fixed the quantization parameter 
quanta = {1,2,..., 10} U {31,127}. The choice of these parameter values was 
empirically based on a previous run of all simulations. As for the parameters 
in LAST, we used the same used in |2]. We direct the reader to [2] for further 
understanding of the parameters and their values used in LAST. 

5.2. Model Seleetion 

Due to the large number of parameter combinations of both Technique]^ and 
Technique our simulations generate many different models with classification 
accuracy/error and classification bit precision. To select the best model, that is, 
the best combination of the parameters k, z_ threshold, and quanta we relied 
on the classification accuracy on a separate data set. Also, we created the 
parameter 7 to control the trade-off between the classification accuracy and the 
classification bit precision. We used 7 = 0.001 and the following steps for the 
model selection: (i) First, we used 80% of the training set to train the models 
(D and w) and used the remaining 20% to estimate the best combination of 
the parameters k, z_ threshold, and quanta, (ii) Let A4 be the set of models 
trained with all combinations of the parameters k, z_threshold, and quanta. 
Also, let TZ = A1(X) be the set of the classification results of the training set 
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X using the models M and best_acc be the best training accuracy from TZ. 
(iii) From A4, we create the subset Aij that contains the models with results 
TZj = 7?.[accuracy >= (1 — 7 ) best_acc]. (iv) From we create a new subset 
Mbits with results TZbUs = 7?,^[number of bits == lowest_num_bits], where 
lowest_num_bits is the lowest number of bits necessary for the computation 
of D^X. (v) From TZbus, we finally choose the model Aibest such that the result 
T^best = TZbits [sparsest representation of X]. 

It is worth noting that the traditional rule of thumb of using 2/3 of the 
dataset to train and 1/3 to test is a safe way of estimating of the true classifica¬ 
tion accuracy when the classification accuracy on the whole dataset set is higher 
than 85% pS]. Nevertheless, as we are solely reserving part of the training set 
for the selection of the best parameters values, and not for the estimation of 
the true classification accuracy, we opted for the more conservative proportion 
of 80% to train our models. This has the advantage of lowering the chance of 
missing an underrepresented training set sample. Moreover, the last step in our 
model selection algorithm selects the model that produces the sparsest signal 
representation, as it leads to models that generalize better m- 

5 . 3 . Results and Analyses 

In this section, the original results are the ones from the classification of the 
test set using the model built with the original LAST algorithm. Conversely, 
the proposed results are the ones obtained from the classification of the test set 
using the best model TZbest built for each dataset. The best model TZbest is the 
one selected using the methodology presented in Section |5.2[ 

We show the results of our simulations on the binary tasks in Figure As 
shown on the bottom of Figures [^a), |^b), and [^c), our techniques do not 
substantially decrease the original classification accuracy. At the same time, 
our techniques considerably reduce the number of bits necessary to perform the 
multiplication D^X, as shown on the top of Figures [|;a),[|;b), and[|;c). 

One can note the original results in Figures]^ a) and|^c) are lower than the 
ones presented in [2. Differently from their work, we used completely disjoint 
training and test sets (with no overlap) to allow a better estimation of the true 
classification accuracy. 

Table m contains the results of the simulations on the tasks MNIST and 
CIFAR-10. The original results we obtained for both large datasets have a 
slightly higher classification error than the ones reported in j2] . We hypothesize 
that this is caused by the random nature of LAST for larger datasets, where 
each GD is optimized for a small portion of the data called mini-batch, which 
is randomly sampled from the training set. Moreover, we trained D and w 
using 4/5 of the training set used in and this may negatively affect the 
generalization power of the dictionary and classifier. 

Note that our techniques resulted in an increase of the classification error 
on both MNIST and CIFAR-10 tasks. Nevertheless, our techniques reduced 
the number of bits necessary to run the classification at test time. Again, this 
dynamic range reduction is highly valuable for applications on FPGA. 
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Figure 5: Comparison of the results using the original LAST algorithm and our proposed 
techniques. Regarding the classification at test time, these figures show for each dataset the 
trade-off between the necessary number of bits (top) and the classification accuracy (bottom). 
Our approach reduces the necessary number of bits to almost half of the original formulation at 
the cost of a slight classification accuracy decrease. The datasets are described in Section|4.1| 


The results we presented in this section indicate the feasibility of using inte¬ 
ger operations in place of floating-point ones and use bit shifts instead of mul¬ 
tiplications with a slight classification accuracy decrease. These substitutions 
reduce the computational cost of classification at test time in FPGAs, which 
is important in embedded applications, where power consumption is critical. 
Moreover, our techniques reduce almost in half the number of bits necessary 
to perform the most expensive operation in the classification, the matrix-vector 
multiplication D^X. This was a result of the application of both Technique]^ 
and Technique]^ 

Also, it is worth noting that our techniques were developed to reduce the 
computational cost of the classification with an expected accuracy reduction, 
within acceptable limits. Nevertheless, the classification accuracies on the bark_- 
woodgrain dataset using our techniques substantially outperforms the accuracies 
using the original model, as shown in Figurea)(bottom). These new higher 
accuracies were unexpected. Regarding the original models, we noted that the 
classification accuracies on the training set were 100% when using dictionaries 
with at least 50 atoms. These models were probably overfitted to the training 
set, making them fail to generalize to new data. As our powerize technique 
introduces a perturbation to the entries of both D and w, we hypothesize that 
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Table 1: Comparison between the original and the proposed results regarding the classification 
error and number of bits necessary to compute the matrix-vector multiplication D^X of the 
sparse representation. 



MNIST 

CIFAR-10 


Error % 

# bits D’X 

Error % 

# bits D’X 

Original 

1.71 

61 

46.27 

55 

Proposed 

2.23 

34 

49.92 

37 


it reduced the overfitting of D and w to the training set and, consequently, 
increased their generalization power on unseen data |28| . However, this needs 
further investigation. 

6. Conclusion 

This paper presented a set of techniques for the reduction of the computa¬ 
tions at test time of classifiers that are based on learned transform and soft- 
threshold. Basically the techniques are: adjust the threshold so the classifier 
can use signals represented in integer instead of their normalized version in 
floating-point; reduce the multiplications to simple bit shifts by approximat¬ 
ing the entries from both dictionary D and classifier vector w to the nearest 
power of 2; and increase the sparsity of the dictionary D by applying a hard- 
thresholding to its entries. We ran simulations using the same datasets used 
in the original paper that introduces LAST and our results indicate that our 
techniques substantially reduce the computation load at a small cost of the 
classification accuracy. Moreover, in one of the datasets tested there was a sub¬ 
stantial increase in the accuracy of the classifier. These proposed optimization 
techniques are valuable in applications where power consumption is critical. 
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